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Automated machine learning is a promising approach widely used to solve 
classification and prediction problems, which currently receives much 
attention for modification and improvement. One of the progressing works 
for automated machine learning improvement is the inclusion of evolutionary 
algorithm such as Genetic Programming. The function of Genetic 
Programming is to optimize the best combination of solutions from the 
possible pipelines of machine learning modelling, including selection of 
algorithms and parameters optimization of the selected algorithm. As a 
family of evolutionary based algorithm, the effectiveness of Genetic 
Programming in providing the best machine learning pipelines for a given 
problem or dataset is substantially depending on the algorithm 
parameterizations including the mutation and crossover rates. This paper 
presents the effect of different pairs of mutation and crossover rates on the 
automated machine learning performances that tested on different types of 
datasets. The finding can be used to support the theory that higher crossover 
rates used to improve the algorithm accuracy score while lower crossover 
rates may cause the algorithm to converge at earlier stage. 
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1. INTRODUCTION 


Despite the notable impact of machine learning techniques in a wide area of application 
domains [1-4], one major issue that emerged since the beginning of the application deployment is complexity 
of design [5-7]. In order to achieve state-of-the-art performance, machine learning techniques require 
cautious steps of data pre-processing and hyper-parameter tuning. Moreover, given with the ever-increasing 
number of machine learning models being developed, model selection is becoming increasingly important. 
With the demands from inexperienced data scientist to use machine learning models, automating the 
algorithm selection and optimizing the hyper-parameters tuning have been among of the preferred 
approaches of the data sciences community. Automated Machine Learning (AML) is the current approach 
proved beneficial to enexperienced data scientist and in some cases offered better performances than the 
conventional or manual approach [4-5]. 
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Currently, AML has been given a wide attention by reseachers and major of works were directed to 
use meta-heuristics approach to optimize the machine learning pipelines [7-10]. Pipelines is a collection of 
machine learning models pooled from the pre-data processing to the hyper-parameters tuning. 
Examples of meta-heristics search algorithms in AML are Particle Swarm Optimization [11], 
Genetic Programming (GP) [12] and Bayesian [13-15]. Depending on the computational availability, 
the process to complete the minimum function of AML optimization is considerable costly. This is due to the 
complex proceses of machine learning algorithms that requires a model trained and performances evaluation 
with the set of hyper-parameters. The evaluation time will be extended when the dataset is not well 
constructed in such of excessive features selections or inappropriate imputation of missing values. 
Nevertheless, by understanding how the the hyper-parameters works in an AML, the complexities and 
completion time to complete might be effectively gained. 

This paper presents the description of AML that used pipelines optimizations through its formal 
definition and conceptual explanation. Additionally, by focusing on GP based AML, the roles of some GP 
hyper-parameters namely mutation and crossover rates were also presented in this paper supported with the 
finding of empirical experiments that tested on variation sizes of dataset. To the best of our knowledge, there 
is very limited literature can be found on discussing and explaining detail of GP hyper-parameters in AML. 
Moreover, few claims were reported on the setting of the two important of control parameters 
(mutation and crossover) in the GP optimization of automated machine learning. The finding provides 
additional fundamental knowledge to the design and development of different setting of mutation and 
crossover rates for the AML based GP. 


2. RESEARCH BACKGROUND 
2.1. Formal definition of machine learning pipeline optimization in AML 
The formal definition of AML with search optimization is as the following: 

1. The objective is to minimizes some predefined loss function L(x ). Me 1 where M is the training 
model tested on the given validation dataset X® with learning algorithm A that learn the training 
dataset X pattern. 

2. The training and validation dataset X (t) X®) consists a set of features vector to be selected, 
pre-processed and well-constructed based on some featurization mechanisms. 

3. The learning algorithm A can be a single machine learning or a hybridization different algorithm that 
need to be configured with a set of hyperparameters H/;—1.n] in such in the following (1). 


M = (Ajeo(X“;Hi;), where Ajen(X “2;H;,); A) (1) 


4. Therefore, the goal of the AML search is to find a set of hyperparameters A* that yield an optimal model 
M* which minimizes the L(x ).M , function, formulated as the following (2). 


At = argamin L(X™; (Ajo (X“?; Hi); Aicn(X; Hi); A) 
= arga minF (A; (Ayoja(X; Hy); Aicn(X; Hy)),£) (2) 


The optimized pipeline is a set of machine learning model with the best selection of machine 
learning algorithm and the best optimal of hyper-parameters setting. 


2.2. Genetic programming based AML 

AML with GP is the interest of study reported in this paper, hence AML TPOT Python library has 
been used as the experimental tool [12, 16]. AML TPOT has capability to provide the best machine learning 
model on a training dataset from a series of processes called featurization, algorithm selection and parameter 
optimization as presented in Figure 1. 

Featurization is the process of selecting the best possible of features from the dataset through the 
Principle Component Analysis (PCA) and polynomial approaches. It also incorporated with a set of 
imputation strategies in handling the missing values of the training dataset. The best optimized features will 
be passed to model selections modules that supported with various kinds of machine learning algorithms 
(more than fifty of classifiers and regressors). The selected algorithm or can be a hybridized model will be 
passed through for the best optimized parameters configurations. AML TPOT used Genetic Programming 
(GP) optimization approach in selecting the best model gained from the combinations of pipelines 
(featurization, model selections, parameters optimization) in the algorithm selection pool. 
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Figure 1. AML TPOT main functions 


As a family of evolutionary algorithm, the performance of GP is highly depending on the parameters 
configuration mainly population sizing, mutation and crossover rates. Mutation and crossover are two 
essential elements in an evolutionary based algorithm [17-19] like GP that can influence the search 
optimization landscape by means of exploitation and exploration. The optimization of best optimal result 
needs a balance between both exploitation and exploration [20]. Exploration can eliminate the search to 
converge prematurely by avoiding the solution from trapping in local optima while exploitation focuses to an 
intensity search direction so that the best potential solution can be converged at the global optima [21]. 
Mutation is highly related to the exploration aspect to control the search diversity, often be defined with a 
very small probability rate to produce a little modification to an individual (chromosome). On the other hand, 
crossover is used to identify the starting point of the best optimal solution at global optima and furtherly 
drives the search direction in achieving the best solution at local optima. Therefore, crossover is generally 
thought to enable the exploitation aspect. In simpler words, the purpose of mutation is to search locally 
around the promising areas found by the crossover operator. 

Generally, the most appropriate mutation and crossover settings required by an evolutionary 
algorithm like GP to solve different optimization problems can be different [22-24]. In addition, evidences 
show that even for one specific optimization problem, the required best strategies for the both operators may 
vary during the evolutionary process [25-26]. 


3. RESEARCH METHOD 
3.1. Experiments and parameters configuration 

In this study, the experiment codes were developed with Pyhton 3.6 Jupyter Notebook platform and 
were implemented in the processor of Intel 17 7th Generation processor, 16 GB RAM. 

The AML TPOT machine learning model has been set to employ 75:25 percent ratio between the 
training and cross validation for all the experiments. TPOT used cross validation for the machine learning 
training approach and the number of default value for the training and validation chunks is defined as 5. 
Therefore, for each population size p, the number of iterations for selecting the best pipelines is equivalent to 
5 * p times. If the number of population size is 10, the GP iteration to finding the best pipelines is 50 times. 
A longer time will be required to complete if the number of iterations increased according to the population 
size and generation setting. Table 1 lists the parameters used in this research. 

The number of iterations for the GP to set the final optimized pipelines is determined by the number 
of generations in such that the algorithm should work better with more generations. Nevertheless, longer time 
would be required to complete with bigger number of generation and 5 is the common setting used in many 
experiments. Population size is the number of individuals or potential pipelines that can be stored in the GP 
selection pool. Our previous experiments found that the default population size 10 is fit enough to the 
algorithm to produce best optimal result. The offspring_size by default is none so that it will be equal to the 
population_size. Mutation rate is a small value used to control the GP exploration search by applying random 
changes to some of the pipelines in the selection pool. The crossover furthermore is used to tell the GP how 
many pipelines to reproduced at each of the generation. The values for both must be in between 0.1 to 0.9 
and not exceed 1.0. The mutation and crossover rates setting are divided into three sets of pair. First pair is 
MCRI1 and MCR2, which the range is between 0.1 and 0.9, second pair is MCR3 and MCR 4 within 0.3 and 
0.7 and lastly is moderate mutation and crossover rates (0.5,0.5). It is anticipated in this study that higher 
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crossover rates in MCR1 and MCR3 can help the algorithm to achieve better optimal results while lower 
crossover rates in MCR2 and MCR4 may affect on shortening the completion time. 


Table 1. The important parameters 


Parameters Configured value 
generations 5: 
population_size 10 
offspring_size none 
Mutation & Crossover rates Divided into 5 groups: 


MCRI= MR 0.1, CR 0.9 

MCR2 = MR 0.9, CR 0.1 

MCR3 = MR 0.3, CR 0.7 

MCR4 = MR 0.7, CR 0.3 

MCRS = MR 0.5, CR 0.5 
*MCR=mutation and crossover rates, MR=mutation rate, CR=crossover rate 


3.2. Datasets 

The evaluation of this study has been conducted on a collection of 8 datasets with different sizes 
downloaded from Kaggle website (www.kaggle.com) as listed in the following Table 2. 

Although the dataset sizes will not the major factor of machine learning performances, interesting 
information might be inferred when using dataset variations on testing the AML TPOT with different 
parameterizations of mutation and crossover. 


Table 2. Datasets 


Dataset Abbreviation Size of records 

Dota2 game Dota 90000 

Human_Resources_Employee_Attrition HREA 14999 
Orange_Telecom_Churn_Data OTCD 5000 
Heart attack Heart 300 

Fraud transaction FT 150870 

Adult Income Al 32000 
Diabetic Diabetic 1151 
Breast cancer Breastcancer 900 


4. RESULTS AND DISCUSSION 

In this section, the result of accuracy score and time to complete (TOC) of the algorithm with the 
five mutation and crossover rates is explained according to the data sizes : small and big. Heart, breastcancer, 
diabetic and OTCD were grouped as a small dataset while on the Breastcancer and OTCD datasets, 
the different setting of mutation and crossover rates did not present a much difference to the accuracy score. 
Different shapes of results can be seen on the Heart and Diabetics dataset. The highest score on Heart 
provided by MCR2 (0.9,0.1) while MCRS (0.3,0.7) is the best setting on Diabetic. 


4.1. Accuracy score 

Figure 2 presents the results of accuracy score for the AML TPOT tested on the datasets with sizes 
less than 10000. The default configuration MCR1 (0.1,0.9) of the AML TPOT remain as the optimal setting 
for two datasets (Heart, Diabetic) but not much effect on the Breastcancer and OTCD datasets. By looking at 
each pair of mutation and crossover setting, there have been some performances effect to the accuracy score 
to the three of datasets (Heart, Breastcancer, Diabetic). Slightly improvement also can be seen on the OTCD 
dataset. Both higher crossover rates in MCRI1 (crossover rate 0.9) and MCR3 (crossover rate 0.7) have 
achieved higher accuracy score compared to the lower crossover rate in MCR2 (crossover rate 0.1) and 
MCR4 (crossover rate 0.3). With the moderate setting of MCRS (0.5,0.5), none of single pattern can be 
observed from all the datasets. 

Figure 3 presents the accuracy scores of AML TPOT with different sets of mutation and crossover 
on the bigger size of datasets. Slightly different with the small datasets, higher crossover rates in MCRI and 
MCR3 did not present a great improvement to the accuracy scores of all datasets. In fact, slightly higher 
accuracy score can be seen on HREA, AI and Dota with MRC4 (0.3 crossover rate). Nevertheless, with the 
very small improvement, the default setting of MCR1 (0.1,0.9) can be considered as the best setting for the 
AML TPOT. 
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Figure 2. Accuracy score for small size datasets 
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Figure 3. Accuracy score for large datasizes 


Based on the results presented from the above Figure 2 and Figure 3, the general conclusion that can 
be made is that crossover with the exploitation aspect can be set to a higher value than mutation rates so that 
the search direction can go deep for better intensity in finding the best machine learning pipelines. 
However, for the case of big datasizes, futher empirical experiments with different factors might be essential 
to be conducted. 


4.2. Time to complete (TTC) 

The TTCs in second that the algoritm has taken for each mutation and crossover rates are presented 
in Figure 4 and Figure 5. As the lower crossover rates should take smaller TTC, it is interesting to get ideas 
on how the different datasets may affect on the TTC. Based on the theory, MCR2 or MCR4 with lower 
crossover rates should able to complete faster than the MCR1 and MCR3. 

As presented in Figure 4, only Breastcancer and Diabetic datasets that support the theory. MCR2 
and MCR4 with lower crossover rates (0.1 and 0.3) has lower TTC than MCRI and MCR3 (0.9 and 0.7). 
On the Heart and OTCD, lower crossover rate in MCR4 faced the lower TTC but found to be similar TTC 
values on MCR1 and MCR2. For all datasets, moderate crossover rate 0.5 is the lowest TTC. Major effect of 
lower crossover rates to the TTCs on all big datasets can be seen in Figure 5. 
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Figure 4. TTC for small size datasets 
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Figure 5. TTC for large size datasizes 


On all large size datasets, MCR2 (crossover rate 0.1) and MCR4 (crossover rate 0.3) have lower 
TTCs compared to MCRI1 (crossover rate 0.9) and MRC3 (crossover rate 0.7). On the contrary with small 
datasets, the moderate setting in MCRS did not have a pattern of effect the the TTCs. 

Based on Figure 4 and Figure 5, it can be generally concluded that lower crossover rates encourage 
the GP optimization search to stop the process at earlier stage mainly on the big datasize. 


5. CONCLUSION 

This paper presents the theory, design of experiments and findings related to the effect of 
parameterizations of genetic programming to an automated machine learning. The genetic programming is a 
kind of meta-heuristics used as the search optimization of finding the best machine learning pipelines. 
Designing the best machine learning is a difficult endeavour and costly mainly for new data scientists. With 
the automated machine learning, the best setting of machine learning model can be effectively conducted. 
However, although with automation scheme, researchers keep on questioning the hyper-parameters setting of 
automated machine learning, in this case the role of crossover and mutation rates. Rather than using the 
default setting, it will be useful if the researhers is provided with the theory and reasonable finding for the 
parameterizations of AML TPOT. As for future works, it will be interesting to study more extensively on the 
nature of datasets, featurizations to the performances of AML TPOT and the implementation of parallel 
computing platform in executing the big datasets. 
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